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Structures  from  Perspective  2D  Views 
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Abstract 

Part  1  of  this  paper  investigates  the  differences  —  conceptually  and  algorithmically  —  between  affine  and 
projective  frameworks  for  the  tasks  of  visual  recognition  and  reconstruction  from  perspective  views.  It 
is  shown  that  an  affine  invariant  exists  between  any  view  and  a  hxed  view  chosen  as  a  reference  view. 

This  implies  that  for  tasks  for  which  a  reference  view  can  be  chosen,  such  as  in  alignment  schemes  for 
visual  recognition,  projective  invariants  are  not  really  necessary.  The  projective  extension  is  then  J  -rived, 
showing  that  it  is  necessary  only  for  tasks  for  which  a  reference  view  is  not  available  —  such  as  happens 
when  updating  scene  structure  from  a  moving  stereo  rig.  The  geometric  difference  between  the  two  proposed 
invariants  are  that  the  affine  invariant  measures  the  relative  deviation  from  a  single  reference  plane,  whereas 
the  projective  invariant  measures  the  relative  deviation  from  two  refere  ice  planes.  The  affine  invariant  can 
be  computed  from  three  corresponding  points  and  a  fourth  point  for  setting  a  scale;  the  projective  invariant 
can  be  computed  from  four  corresponding  points  and  a  fifth  point  for  setting  a  scale.  Both  the  affine  and 
projective  invariants  are  shown  to  be  recovered  by  remarkably  simple  and  linear  methods. 

In  part  II  we  use  the  affine  invariant  to  derive  new  algebraic  connections  between  perspective  views.  It 
is  shown  that  three  perspective  views  of  an  object  are  connected  by  certain  algebraic  functions  of  image 
coordinates  alone  (no  structure  or  camera  geometry  needs  to  be  involved).  In  the  general  case,  three  views 
satisfy  a  trilinear  function  of  image  coordinates.  In  case  where  two  of  the  views  are  orthographic  and  the 
third  is  perspective  the  function  reduces  to  a  bilinear  form.  In  case  all  three  views  are  orthographic  the 
function  reduces  further  to  a  linear  form  (the  “linear  combination  of  views"  of  [31]).  These  functions  are 
shown  to  be  useful  for  recognition,  among  other  applications. 


Copyright  ©  Massachusetts  Institute  of  Technology,  1993 


This  report  describes  research  done  within  the  Center  for  Biological  and  Computational  Learning  in  the  Department  of  Brain 
and  Cognitive  Sciences,  and  at  the  Artificial  Intelligence  Laboratory.  Support  for  the  A.I.  Laboratory’s  artificial  intelligence 
research  is  provided  in  part  by  the  Advanced  Research  Projects  Agency  of  the  Department  of  Defense  under  Office  of  Naval 
Research  contract  N00014-91-J-4038.  Support  for  the  Center's  research  is  provided  in  part  by  ONR  contracts  N00014-91-J- 
1270  and  N00014-92-J-1879;  by  a  grant  from  the  National  Science  Foundation  under  contract  ASC-9217041  (funds  provided 
by  this  award  include  funds  from  ARPA  provided  under  HPCC):  and  by  a  grant  from  the  National  Institutes  of  Health  under 
contract  NIH  2-S07-RR07047-26.  Additional  support  is  provided  by  the  North  Atlantic  Treaty  Organization,  ATR  Audio  and 
Visual  Perception  Research  Laboratories.  Mitsubishi  Electric  Corporation,  Siemens  AG.,  and  Sumitomo  Metal  Industries.  A. 
Shashua  is  supported  by  a  McDonnell-Pew  postdoctoral  fellowship  frou;  the  department  of  Brain  and  Cognitive  Sciences. 


93-23961 


93  lO  8 


073 


1  Introduction 

The  geometric  relation  between  objects  (or  scenes)  in 
the  world  and  their  images,  taken  from  different  viewing 
positions  by  a  pin-hole  camera,  ha,s  many  subtleties  and 
nuances  and  has  been  the  subject  of  research  in  computer 
vision  since  its  early  days.  Two  major  areas  in  computer 
vision  have  been  shown  to  benefit  from  an  analytic  treat¬ 
ment  of  the  3D  to  20  geometry:  visual  recognition  and 
reconstruction  from  multiple  views  (as  a  result  of  having 
motion  sequences  or  from  stereopsis). 

A  recent  approach  with  growing  interest  in  the  past 
few  years  is  based  on  the  idea  that  non-metric  informa¬ 
tion,  although  weaker  than  the  information  provided  by- 
depth  maps  and  rigid  camera  geometries,  is  nonetheless 
u.seful  in  the  sense  that  the  framework  may  provide  sim¬ 
pler  algorithms,  camera  calibration  is  not  required,  more 
freedom  in  picture-taking  is  allowed  —  such  as  taking 
pictur>“s  of  pictures  of  objects,  and  th^re  is  0*^  to 

make  a  distinction  between  orthographic  and  perspective 
projections.  The  list  of  contributions  to  this  framework 
include  (though  not  intended  to  be  complete)  [14.  26. 
33,  34,  9,  20,  3,  4,  28.  29,  19,  31,  23,  5,  6,  18,  27,  13.  12] 
—  and  relevant  to  this  paper  are  the  work  described  in 
[14,  4,  26.  28.  29]. 

This  paper  has  two  parts.  In  Part  I  we  investi¬ 
gate  the  intrinsic  differences  —  conceptually  and  algo¬ 
rithmically  —  between  an  affine  framework  for  recog¬ 
nition/reconstruction  and  a  projective  framework.  Al¬ 
though  the  distinction  between  affine  and  projective 
spaces,  and  between  affine  and  projective  properties,  is 
perfectly  clear  from  classic  studies  in  projective  and  alge¬ 
braic  geometries,  as  can  be  found  in  [8,  24,  25],  it  is  less 
clear  how  these  concepts  relate  to  reconstruction  from 
multiple  views.  In  other  words,  given  a  set  of  views,  un¬ 
der  what  conditions  can  we  expect  to  recover  affine  in¬ 
variants?  what  is  the  benefit  from  recovering  projective 
invariants  over  affine?  are  there  tasks,  or  methodologies, 
for  which  an  affine  framework  is  completely  sufficient? 
what  are  the  relations  between  the  set  of  views  generated 
by  a  pin-hole  camera  and  the  set  of  all  possible  projec¬ 
tions  I —  T*'  of  a  particular  object?  These  are  the 

kinds  of  questions  for  which  the  current  literature  does 
not  provide  satisfactory  answers.  For  example,  there  is  a 
tendency  in  some  of  the  work  listed  above,  following  the 
influential  work  of  [14],  to  associate  the  affine  framework 
with  reconstruction/recognition  from  orthographic  views 
only.  As  will  be  shown  later,  the  affine  restriction  need 
not  be  coupled  with  the  orthographic  restriction  on  the 
model  of  projection  —  provided  we  set  one  view  fixed.  In 
other  words,  an  uncalibrated  pin-hole  camera  undergo¬ 
ing  general  motion  can  indeed  be  modeled  cts  an  “affine 
engine"  provided  we  introduce  a  "reference  view",  i.e., 
all  other  views  are  matched  against  the  reference  view 
for  recovering  invariants  or  for  achieving  recognition. 

In  the  course  of  addressing  these  issues  we  derive  two 
new,  extremely  simple,  schemes  for  recovering  geometric 
invariants  —  one  affine  and  the  other  projective  —  which 
can  be  used  for  recognition  and  for  reconstruction. 

Some  of  the  ideas  presented  in  this  part  of  the  pa¬ 
per  follow  the  work  of  [14.  4.  26.  28.  29].  Section  3  on 
affine  reconstruction  from  two  perspective  views,  follows 


and  expands  upon  the  work  of  [26,  14,  4].  Section  4  on 
projective  reconstruction,  follows  and  refines  the  results 
presented  in  [28.  29], 

In  Part  II  of  this  paper  we  use  the  results  established 
in  Part  I  (specifically  those  in  Section  3)  to  addre.ss  cer¬ 
tain  algebraic  aspects  of  the  connections  between  mul¬ 
tiple  views.  Inspired  by  the  work  of  [31],  we  address 
the  problem  of  establishing  a  direct  connection  between 
views,  expres.sed  as  functions  of  image  coordinates  alone 
—  which  we  call  "algebraic  functions  of  views".  In  addi¬ 
tion  to  linear  functions  of  views,  discovered  by  [31],  ap¬ 
plicable  to  orthographic  views  only,  we  show  that  three 
perspective  views  are  related  by  trilinear  functions  of 
their  coordinates,  and  by  bilinear  functions  if  two  of  the 
three  views  are  assumed  orthographic  —  a  ra.se  that  will 
be  argued  is  relevant  for  purposes  of  recognition  without 
constraining  the  generality  of  the  recognition  proce.ss. 
Part  II  ends  with  a  discussion  of  p-,  applications 

for  algebraic  functions,  other  than  visual  recognition. 

2  Mathematical  Notations  and 
Preliminaries 

We  consider  object  space  to  be  the  three-dimensional 
projective  space  .  and  image  space  to  be  the  two- 
dimensional  projective  space  V-.  Within  we  will  be 
considering  the  projective  group  of  transformations  and 
the  affine  group.  Below’  we  describe  basic  definitions  and 
formalism  related  to  projective  and  affine  geometries  — 
more  details  can  be  found  in  [8,  24,  25]. 

2.1  Affine  and  Projective  Spaces 

Affine  space  over  the  field  K  is  simply  the  vector  space 
A'”,  and  is  usually  denoted  as  A” .  Projective  space  P" 
is  the  set  of  equivalence  classes  over  the  vector  space 
A'"'*''.  A  point  in  P”  is  usually  written  as  a  homoge¬ 
neous  vector  ( ro Xn  )•  which  is  an  ordered  set  of  n  -|- 1 

real  or  complex  numbers,  not  all  zero,  whose  ratios  only 
are  to  be  regarded  as  significant.  Two  points  x  and  y 
are  equivalent,  denoted  by  x  =  y,  if  *  =  Ay  for  some 
scalar  A.  Likewise,  two  points  are  distinct  if  there  is  no 
such  scalar. 

2.2  Representations 

The  points  in  P"  admit  a  cltiss  of  coordinate  represen¬ 
tations  P  such  that  if  Po  is  any  one  allowable  repre¬ 
sentation.  the  whole  class  P  consists  of  all  those  rep¬ 
resentations  that  '•an  be  obtained  from  Po  by  the  ac¬ 
tion  of  the  group  6'Ln+i  of(n-|-l)x(n-l-l)  non¬ 
singular  matrices.  It  follows,  that  any  one  coordinate 
representation  is  completely  specified  by  its  standard 
simplex  and  its  unit  point.  The  standard  simplex  is 
the  set  of  n  -f  1  points  which  have  the  standard  coor¬ 
dinates  (1,0 . U),(0, 1,0 . 0) . (0,0 . 0. 1)  and  the 

unit  point  is  the  point  whose  coordinates  are  (1,1 . 1). 

It  also  follows  that  the  coordinate  transformation  be¬ 
tween  any  two  representations  is  completely  determined 
from  n  -b  1  corresponding  points  in  the  two  representa¬ 
tions,  which  give  rise  to  a  linear  system  of  (n  -t-  !)■  —  1 
or  (n  -b  1)-  equations  (depending  on  whether  we  set  an 
arbitrary  element  of  the  matrix  transform,  or  set  one  of 
the  scale  factors  of  the  corresponding  points). 


2.3  Subspaces  and  Cross  Ratios 

A  linear  subspace  A  S  C  V'  is  a  hyperplane  if  k  = 
11  —  1,  is  a  line  when  ;!•  =  1.  and  otherwise  is  a  k-plane. 
There  is  a  unique  iim  in  P’  ilirtaij-h  any  two  distinct 
points.  Any  point  z  on  a  line  can  be  described  as  a  linear 
combination  of  two  fixed  points  x.y  on  the  line,  i.e., 
z  =  x  +  k-y.  Let  It  S  x  +  k'y  be  another  point  on  the  line 
spanned  by  x.  y.  then  the  cross  ratio  of  the  four  points  is 
simply  o  =  k j k'  which  is  invariant  in  all  representations 
Tv.  By  permuting  the  four  points  on  the  line  the  24 
[tossible  cross  ratios  fall  into  six  sets  of  four  with  values 
o .  l/o,  1  —  o,  (o  —  I  )/o ,  o/((t  —  1 )  and  l/(  1  —  o ). 


2.4  Projections 

Let  C  P"  be  sonir  hyperplant',  and  a  point  O  E 

P"  not  lying  on  P"~'.  If  we  like,  we  can  choose  the 
representation  such  that  'P"~'  is  given  by  x,,  =  0  and 
the  point  O  =  (0,0 . 0,  1).  We  can  define  a  map 

rr„  :  P"  -  {O}  -  p"-' 


by 


;  P  —  OPn  P’’-': 


that  is,  sending  a  point  P  €  P"  other  than  O  to  the  point 
of  intersection  of  the  line  OP  with  the  hyperplane  P”"'. 
(T„  is  the  projection  from  the  point  O  to  the  hyperplane 
P''“^  and  the  point  O  is  called  the  center  of  projection 
(COP).  In  terms  of  coordinates  x.  this  amounts  to 


■  ( -t'O . )  I  (  J’f).  . . .,  J'rj  _  1  )  . 

.As  an  example,  the  projection  of  3D  objects  otito  an 
image  plane  is  modeled  by  x  i —  Tx.  where  T  is  a  3  x 
4  matrix,  often  called  the  camera  transformation.  The 
set  5  of  all  views  of  an  object  (ignoring  problems  of 
self  occlusion,  i.e..  assuming  that  all  points  are  visible 
from  all  viewpoints)  is  obtained  by  the  group  G'1,4  of 
4  X  4  non-singular  matrices  applied  to  some  arbitrary- 
representation  of  P■^  and  then  dropping  the  coordinate 

J’3- 


2.5  The  Afhne  Subgroup 

Let  .4i  C  T"  be  the  subset  of  points  (  j-q,  ....  )  with 

li  ^  0.  Then  the  ratios  Xj  =  Xj/x,  are  well  defined  and 
are  called  affine  or  Euclidean  coordinates  on  the  projec¬ 
tive  space,  and  .4,  is  bijective  to  the  affine  space  A”. 
i.e.  .4,  =  A” .  The  affine  subgroup  of  GL„+]  leaves 
the  hyperplane  a-,  =  0  invariant  under  all  affine  repre¬ 
sentations.  Any  subgroup  of  GL„+i  that  leaves  some 
hyperplane  invariant  is  an  affine  subgroup,  and  the  in¬ 
variant  hyperplane  is  called  the  ideal  hyperplane.  As  an 
example,  a  subgroup  of  GL^  that  leaves  some  plane  in¬ 
variant  is  affine.  It  could  be  any  plane,  but  if  it  is  the 
plane  at  infinity  {xn  =  0)  then  the  mapping  P^  •—  P~ 
is  created  by  parallel  projection,  i.e.,  the  COP  is  at  in¬ 
finity.  Since  two  lines  are  parallel  if  they  meet  on  the 
ideal  hyperplane,  then  when  the  ideal  hyperplane  is  at 
infinity,  affine  geometry  takes  its  "intuitive"  form  of  pre¬ 
serving  parallelism  of  lines  and  planes  and  preserving 
ratios.  The  importance  of  the  affine  subgroups  is  that 
therp  exist  affine  invariants  that  are  not  projective  in¬ 
variants.  Parallelism,  the  concept  of  a  midpoint,  area  of 
triangles,  clcussitication  of  conics  are  examples  of  affine 
properties  that  are  not  projective. 


2.6  Epipoles 

Given  two  cameras  with  positions  of  their  COP  at 
0.0'  €  T*-*.  respectively,  the  epi()oles  are  at  the  inter.scc- 
tion  of  the  line  OO'  with  both  image  planes.  Recovering 
the  epipoles  from  point  correspondences  across  two  views 
is  remarkably  simple  but  is  notoriously  sensitive  to  noise 
in  image  measurements.  For  more  details  on  recovering 
epipoles  see  [4,  29.  2{'.  .j].  and  for  comparative  and  errt)r 
analysis  see  [17.  22].  In  Part  I  of  this  paper  we  assume 
the  epipoles  are  given;  in  Part  II,  where  we  make  further 
use  of  rierivations  made  in  Section  3.  we  show  that  for 
purposes  discu.s.sed  there  one  can  eliminate  the  epipoles 
altoget  her. 


2.7  Image  Coordinates 

Imag'  space  is  P- .  Since  the  image  plain  is  finite,  vm  can 
assign,  without  loss  of  generality,  the  value  1  as  the  third 
homogeneous  coordinate  to  every  image  [)oint.  I  hat  is, 
if  (x.y)  are  the  observed  image  coordinates  of  some  point 
(with  respect  to  some  arbitrary  origin  say  the  geomet¬ 
ric  center  of  the  image),  then  p  =  (j-.  t/.  1)  denotes  the 
homogeneous  coordinates  of  the  image  plane.  .Note  that 
by  this  notation  we  are  not  assuming  that  an  observed 
point  in  one  image  is  always  mapped  onto  an  observed 
(i.e.,  not  at  infinity)  point  in  another  view  (that  would 
constitute  an  affine  plane)  all  what  we  are  relying 
upon  is  that  points  at  infinity  are  not  observed  anyway, 
so  we  are  allowed  to  a.ssign  the  value  1  to  all  ob.served 
points. 

2.8  General  Notations 

Vectors  are  always  column  vectors,  unless  mentioned 
otherwise.  The  transpose  notation  will  be  added  only 
wdien  otherwise  there  is  a  chance  for  confusion.  Vectors 
will  be  in  bold-face  only  in  conjunction  with  a  scalar,  i.e., 
A*  stands  for  the  scalar  A  scaling  the  vector  x.  Scalar 
product  will  be  noted  by  a  center  dot,  i.e..  x  ■  y.  again 
avoiding  the  transpose  notation  except  when  necessary. 
(Toss  product  will  be  denoted  as  usual,  i.^.,  x  x  y.  The 
cross  product,  viewed  as  an  operator,  can  be  used  be¬ 
tween  a  vector  x  and  a  3  x  3  matrix  .4  as  follows: 


r  X  ,4  = 


roOs  —  J”3a2 
raoi  -  ria.3 
riOo  —  Xodi 


where  a1.a2.a3  are  the  row  vectors  of  .4,  and  x  = 
(x1.x2.x3}. 


Part  I 

3  Affine  Structure  and  Invariant  From 
Two  Perspective  Views 

The  key  idea  underlying  the  derivations  in  this  section  is 
to  place  the  two  camera  centers  as  part  of  the  reference 
frame  (simplex  and  unit  point)  of  P^.  Let  Pi.  P2,  P3  be 
three  object  points  projecting  onto  corresponding  points 
Pj.p'j.  j  =  1,2,3,  in  the  two  views.  We  assign  the  coor¬ 
dinates  ( 1, 0. 0, 0),  (0, 1,0, 0),  (0, 0,  1.0)  to  Pi .  Ft,  P3,  re¬ 
spectively.  For  later  reference,  the  plane  passing  through 
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Pi.P-j.Ps  will  be  denoted  by  tti.  Let  O  be  the  COP  of 
the  first  camera,  and  O'  the  COP  of  the  second  camera. 
VVe  assign  the  coordinates  (0, 0. 0, 1 ),  ( 1, 1, 1, 1)  to  0.0' . 
respectively  (see  Figure  1).  This  choice  of  representation 
is  always  possible  because  the  two  camercis  are  part  of 
.  By  construction,  the  point  of  intersection  of  the  line 
OO'  with  TTi  has  the  coordinates  (1, 1,  1,0)  (note  that  tti 
is  the  plane  X3  =  0,  therefore  the  linear  combination  of 
O  and  O'  with  X3  =  0  must  be  a  multiple  of  (1, 1, 1,0)). 

Let  P  be  some  object  point  projecting  onto  p.p'.  The 
line  OP  intersects  tti  at  the  point  (a./i.'y.O).  The  coor¬ 
dinates  Q.tJ.')  can  be  recovered  by  projecting  the  image 
plane  onto  tri ,  as  follows.  Let  v,  v'  be  the  location  of  both 
epipoles  in  the  first  and  second  view,  respectively  (see 
Section  2.6).  Given  the  epipoles  v  and  v'.  we  have  by  our 
choice  of  coordinates  that  pi ,  po,  pa  and  v  are  projectively 
(in  V-)  mapped  onto  ei  =  ( 1, 0, 0), =  (0. 1.0), €3  = 
(0,0, 1)  and  £4  =  (1, 1. 1).  respectively.  Therefore,  there 
exists  a  unique  element  Ai  G  PGL3  (3x3  matrix  defined 
up  to  a  scale)  that  satisfies  AiPj  =  ej.  j  =  1,2,3,  and 
.4i  V  =  £4.  Note  that  we  have  made  a  choice  of  scale  by 
setting  Til’  to  £4,  this  is  simply  for  convenience  as  will 
be  clear  later  on.  It  follows  that  Tip  =  (q,/?,  7). 

Similarly,  the  line  O'P  intersects  i:\  at  (a',/3',7',0). 
Let  A2  €  PGL3  be  defined  by  T^p'  =  ej.  j  =  1,2,3,  and 
A^v'  =  £4.  It  follows  that  Anp'  =  {a'.0','^').  Since  P 
can  be  described  as  a  linear  combination  of  two  points 
along  each  of  the  lines  OP,  and  O'P,  we  have  the  fol¬ 
lowing  equation: 
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from  which  it  immediately  follows  that  k  =  s.  We  have 
therefore,  by  the  choice  of  putting  both  cameras  on  the 
frame  of  reference,  that  the  transformation  in  is  affine 
(the  plane  tti  is  preserved).  If  we  leave  the  first  camera 
fixed  and  move  the  second  camera  to  a  new  position 
(must  be  a  general  position,  i.e..  O'  ^  tti),  then  the 
transformation  in  belongs  to  the  same  affine  group. 
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Note  that  since  only  ratios  of  coordinates  are  significant 
in  P".  F  is  determined  up  to  a  uniform  scale,  and  any 
point  Po  ^  TTi  can  be  used  to  set  a  mutual  scale  for 
all  view’s  —  by  .setting  an  appropriate  scale  for  1',  for 
example.  The  value  of  k  can  easily  be  determined  as 
follows:  we  have 


Multiply  both  sides  by  .4.,  *  for  which  we  get 

pp'  =  .4p  —  kv' ,  ( 1 ) 

where  .4  =  T.j'.li.  .Note  that  .4  6  PGL3  is  a 
collineation  between  the  two  image  planes,  due  to  tti. 
determined  by  p'  =  Apj.  j  =  1.2.3.  and  .Ir  =  c'  (there¬ 
fore.  can  be  recovered  directly  without  going  through 
.4i,T2).  Since  k  is  determined  up  to  a  uniform  scale, 
we  need  a  fourth  correspondence  Po,p'o-  and  let  .4,  or  c'. 
be  scaled  such  that  pj,  =  .4p„  —  r'.  Then  k  is  aii  affine 
invariant,  which  we  will  refer  to  as  “affine  depth".  Fur¬ 
thermore,  (jr.y,  1,F)  are  the  homogeneous  coordinates 
representation  of  P.  and  the  3x4  matrix  [,4.— r']  is  a 
camera  transformation  matrix  between  the  tw’o  views. 
Note  that  k  is  invariant  when  computed  against  a  refer¬ 
ence  view  (the  first  view  in  this  derivation),  the  camera 
transformation  matrix  dies  not  only  depend  on  the  cam¬ 
era  displacement  but  on  the  choice  of  three  points,  and 
the  camera  is  an  “affine  engine”  if  a  reference  view  is 
available.  More  details  on  theoretical  aspects  of  this  re¬ 
sult  are  provided  in  Section  3.2,  but  first  we  discuss  its 
algorithmic  aspect. 

3.1  Two  Algorithms:  Re-projection  and  Affine 
Reconstruction  from  Two  Perspective 
Views 

On  the  practical  side,  we  have  arrived  to  a  remarkably 
simple  algorithm  for  affine  reconstruction  from  two  per¬ 
spective/orthographic  views  (with  an  uncalibrated  cam¬ 
era),  and  an  algorithm  for  generating  novel  views  of  a 
scene  (re-projection).  For  reconstruction  we  follow  these 
steps: 

1.  Compute  epipoles  v,v'  (see  Section  2.6). 

2.  Compute  the  matrix  A  that  satisfies  .4pj  =  p'- ,  j  = 
1,2,3,  and  Av  =  v'.  This  requires  a  solution  of  a 
linear  system  of  eight  equations  (see  Appendices  in 
[19,  27.  28]  for  details). 

3.  Set  the  scale  of  v'  by  using  a  fourth  corresponding 
pair  Po,p'„  such  that  p'^  “  Apo  —  v'. 

4.  For  every  corresponding  pair  p,  p'  recover  the  affine 
depth  k  that  satisfies  p'  =  Ap—  kv' .  As  a  technical 
note,  k  can  be  recovered  in  a  least-squares  fashion 
by  using  cross-products: 

_  (p'  X  t’')^(p'  X  Ap) 

II  p'  X  v'  IP 

Note  that  k  is  invariant  as  long  as  we  use  the  first  view 
as  a  reference  view,  i.e.,  compute  k  between  a  reference 
view  p  and  any  other  view.  The  invariance  of  k  can  be 


used  to  "re-project”  the  object  onto  any  third  view  p", 
as  follows.  VVe  observe: 

p"  S  Bp-kv". 

for  some  (unique  up  to  a  scale)  matrix  B  and  epipnie  r". 
One  can  solve  for  B  and  r"  by  observing  six  correspond¬ 
ing  points  between  th '  first  and  third  view.  Each  pair  of 
corresponding  points  pj.p”  contributes  two  equations; 

L  \  L  ff  I  f*  .  i  . 

"Sl'lj'tj  +  J-j  — i-j  ('3  +2^  — 

biiJ-j  +  binpj  +  6i3  —  kj  (■'/, 

I  //it  n  I  II  II  ,  II 

baiXjPj  +  b32yjyj  -kjv-^yj  +  yj  = 

bnii'j  -f  bnnyj  633  —  kj  i  'l. 

where  633  =  1  (this  for  setting  an  arbitrary  scale  because 
the  system  of  equations  is  homogeneous  —  of  course 
this  prevents  the  case  where  633  =  0.  but  in  practice 
this  is  not  a  problem;  also  one  can  use  principal  compo¬ 
nent  analysis  instead  of  setting  the  value  of  some  cho¬ 
sen  elernent  of  B  or  v").  The  values  of  kj  are  found 

from  the  corie-pondences  pj.p'j.  j  =  1 . fi  (note  that 

A'l  =  kn  =  E3  =  0).  Once  B.  r"  are  recovered,  we  can 
find  the  location  of  p"  for  any  seventh  point  p,  ,  by  first 
•solving  for  /•,  from  the  equation  p'  =  Api  —  k,v'.  and  then 
substituting  the  result  in  the  equation  p”  =  Bpi  —  kiv". 

3.2  Results  of  Theoretical  Nature 

Let  t'o  €  >5*  be  some  view  from  the  set  of  all  possible 
views,  and  let  pi.po.pa  €  t„  be  non-collinear  points 
projected  from  some  plane  tr.  Also,  let  C  <5  be  the 
subset  of  views  for  which  the  corresponding  pairs  of  pj. 
j  —  1,2,3,  are  non-cullinear  (.4  is  full  rank).  Note  that 
Sw  contains  all  views  for  which  the  COP  is  not  on  n.  We 
have  the  following  result: 

There  exists  an  affine  invariant  between  a  reference  view 
Wo  and  the  set  of  views  S-^- 

The  result  implies  that,  within  the  framework  of  un¬ 
calibrated  cameras,  there  are  certain  tasks  which  are  in¬ 
herently  affine  and,  therefore,  projective  invariants  are 
not  necessary  and  instead  affine  invariants  are  sufficient 
(it  is  yet  to  be  shown  when  exactly  do  we  need  to  recover 
projective  invariants  —  this  is  the  subject  of  Section  4). 
Consider  for  example  the  task  of  recognition  within  the 
context  of  alignment  [30.  11].  In  the  alignment  approach, 
two  or  more  reference  views  (also  called  model  views), 
or  a  3D  model,  are  stored  in  memory  —  and  referred  to 
as  a  "model"  of  the  object.  During  the  recognition  pro¬ 
cess,  a  small  number  of  corresponding  points  between 
the  reference  views  and  the  novel  view  are  used  for  "re¬ 
projecting”  the  object  onto  the  novel  viewing  position 
(eis  for  example  using  the  method  described  in  the  previ¬ 
ous  section).  Recognition  is  achieved  if  the  re-projected 
image  is  successfully  matched  against  the  input  image. 
This  entails  a  sequential  search  over  all  possible  models 
until  a  match  is  found  between  the  novel  view  and  the 
re-projected  view  using  a  particular  model.  The  impli¬ 
cation  of  the  result  above  is  that  since  alignment  uses 
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a  fixed  set  of  reference  views  of  an  object  to  perform 
recognition,  then  only  affine  machinery  is  really  neces¬ 
sary  to  perform  re-projection.  As  will  be  shown  in  Sec¬ 
tion  4.  projective  machinery  requires  more  points  and 
slightly  more  computations  (but  see  Section  9  for  dis¬ 
cussion  about  practical  considerations). 

The  manner  in  which  affine-depth  was  derived  gives 
rise  to  a  refinement  on  the  general  result  that  four  corre¬ 
sponding  points  and  the  epipoles  are  required  for  affine 
reconstruction  from  two  perspective  views  [4.  29].  Our 
derivation  shows  that  in  addition  to  the  epipoles.  we 
need  only  three  points  to  recover  affine  structure  up  to 
a  uniform  scale,  and  therefore  the  fourth  point  is  needed 
only  for  setting  such  a  scale.  To  summarize. 

In  case  where  the  location  of  epipoles  are  known,  then 
three  corresponding  points  are  sufficient  for  computing 
the  affine  structure,  up  to  a  uniform  but  unknown  scale, 
for  all  other  points  in  space  projecting  onto  correspond¬ 
ing  points  in  both  views. 

We  have  also. 

Affine  shape  can  be  described  as  the  ratio  of  a  point  P 
from  a  plane  and  the  COP.  normalised  by  the  ratio  of  a 
fixed  point  from  the  reference  plane  and  the  COP. 

Therefore,  affine-depth  k  depends  only  three  points 
(setting  up  a  reference  plane),  the  COP  (of  the  reference 
view)  and  a  fourth  point  for  setting  a  scale.  This  way 
of  describing  structure  relative  to  a  reference  plane  is 
very  similar  to  what  [14]  suggested  for  reconstruction 
from  two  orthographic  views.  The  difference  is  that  there 
the  fourth  point  played  the  role  of  both  the  COP  and 
for  setting  a  scale.  W’e  will  show  next  that  the  affine- 
depth  structure  description  derived  here  reduces  exactly 
to  what  [14]  described  in  the  orthographic  case. 

There  are  two  ways  to  look  at  the  orthographic  case. 
First,  when  both  views  are  orthographic,  the  collineation 
A  (in  Equation  1)  between  the  two  images  is  an  affine 
transformation  in  P-.  i.e.,  third  row'  of  ,4  is  (0,0,1). 
Therefore,  A  can  be  computed  from  only  three  corre- 


spending  points,  Apj  ^  Pj.  j  =  Because  both  O 

and  O'  are  at  inhnity,  then  the  epipole  i '  is  on  the  plane 
=  0,  i.e..  i-y  =  1).  and  as  a  result  all  epipolar  lines 
are  parallel  to  each  other.  A  fourth  corresponding  point 
p„.p',,  can  be  used  to  deteriuine  both  tlu'  direction  of 
epipolar  lines  and  to  set  the  scale  for  the  afhne  depth  of 
all  other  points  as  described  in  [11].  W'e  see,  therefore, 
that  the  orthographic  case  is  simply  a  particular  case  of 
Etpiatioti  1,  Alternatively,  consider  again  the  structure 
description  entailed  by  our  derivation  of  afhne  depth.  If 
we  denote  the  point  of  intersectirui  of  the  line  OP  with 
T)  by  P.  we  have  (see  Figure  2) 


P.-o 


l.et  O  (the  ('()f’  of  the  first  camera)  go  to  infinity,  in 
which  cast'  affine-depth  approaches 


which  is  precisely  the  way  shape  was  described  in  [Id] 
(see  also  [20,  27]).  Iti  the  second  view,  if  it  is  or¬ 
thographic,  then  the  two  trapezoids  P.P.p'.Ap  and 
!\.p'„.Ap.  are  similar,  and  frotti  similarity  of  trape¬ 
zoids  we  obtain 

P-P  ^  P'  -  Ap 
P.  -  l\  p'o  -  ' 

which,  agaiti,  is  the  expre.ssion  described  in  [M,  2()].  Note 
that  affine-depth  in  the  orthographic  case  does  not  de¬ 
pend  any  more  on  O.  and  therefore  remaitts  fixed  regard¬ 
less  of  what  pair  of  views  we  choose,  tiatiiely,  a  referetice 
view  is  not  necessary  atiy  mure.  This  leads  to  the  fol¬ 
lowing  result: 

L(t  C  S  bf  th(  subset  of  views  created  by  means  of 
parallel  projection,  i.e..  the  plane  j-;  =  0  is  preserved, 
(liven  four  fixed  reference  points,  affine-depth  on  S  is 
reference-view-dependent,  whereas  affine-depth  on  S  is 
refe  re  nee -view- independent. 

Consider  next  the  resulting  camera  transformation 
matrix  [.d,— i'].  The  matrix  A  depends  on  the  choice  of 
three  points  and  therefore  does  not  only  depend  on  the 
camera  displacement.  This  additional  degree  of  freedom 
is  a  direct  result  of  our  camera  being  uncalibrated,  i.e.. 
W’e  are  free  to  choose  the  internal  camera  parameters  (fo¬ 
cal  length,  principal  point,  and  image  coordinates  scale 
factors)  as  we  like.  The  matrix  A  is  unique,  i.e.,  depetids 
only  on  camera  displacement,  if  we  know  in  advance  that 
the  internal  camera  parameters  remain  fixed  for  all  view’s 
St,.  For  example,  assume  the  camera  is  calibrated  in  the 
usual  manner,  i.e.,  focal  length  is  1,  principle  point  is  at 
(0.0.  1)  in  Euclidean  coordinates,  and  image  scale  factors 
are  1  (image  plane  is  parallel  to  xy  plane  of  Euclidean 
coordinate  system).  In  that  ceise  ,4  is  an  orthogonal  ma¬ 
trix  and  can  be  recoveretl  from  two  corresponding  points 
and  the  epipoles  —  by  imposing  the  constraint  that  vec¬ 
tor  magnitudes  remain  unchanged  (each  point  provides 


three  equations).  .A  third  corres[)onding  point  can  be 
used  to  determine  tfie  rehertiou  component  (i  e..  mak¬ 
ing  sure  the  determinant  of  .1  is  1  rather  than  -1).  More 
details  can  be  found  in  [27,  lo].  Since  m  the  uucahbrated 
ca.se  .1  is  not  unique,  let  .1,  dei'ote  the  fact  that  I  is 
the  collineation  induced  by  a  plane  ;r.  and  let  k\  deiiot" 
the  fact  that  the  affiiie-depth  also  depends  on  tin  choice 
of  !r.  \\  e  see.  therefore,  that  there  exists  a  family  of 

solutions  for  the  camera  transformation  matrix  and  the 
affiiie-iiepth  as  a  function  of  tv.  This  immediately  implies 
that  a  naive  solution  for  .A.k.  given  i  '.  from  point  corre¬ 
spondences  leads  to  a  singular  system  of  equations  (even 
if  many  points  are  used  for  a  least -squares  solution). 

(liven  the  epipole  v' .  the  line  ar  syste  m  of  equations  for 
solving  for  .\  and  kj  of  the  equation 

pp'j  =  Apj  -  k-jv'. 

from  point  correspondences  pj  .  p'j  is  singular,  unle  ss  fur¬ 
ther  constraints  are  introduced. 

W  e  see  that  equation  counting  alone  is  not  sufficient 
for  obtaining  a  unique  solution,  and  therefore  the  knowl¬ 
edge  that  .4  is  a  homography  of  a  plane  is  critical  for  this 
task.  For  example,  one  ran  solve  for  .4  and  kj  from  many 
correspondences  in  a  least-squares  approach  by  first  set¬ 
ting  k-j  =  0.  j  =  1 . 2. 3  and  F.)  =  1 ,  otherwise  t he  solut  ion 
may  not  be  unique. 

Finally,  consider  the  ’price”  we  are  paying  for  an  un¬ 
calibrated,  affine  framework.  We  ran  view  this  in  two 
ways,  somewhat  orthogonal.  First,  if  the  scene  is  un¬ 
dergoing  transformations,  and  the  camera  is  fixed,  then 
thase  transformations  are  affine  in  3D.  rather  than  rigid. 
For  purposes  of  achieving  visual  recognition  the  price  w’e 
are  paying  is  that  we  might  confuse  two  different  ob¬ 
jects  that  are  affinely  related.  Second,  because  of  the 
non-uniqueness  of  the  camera  transformation  matrix  if 
appears  that  the  set  of  views  is  a  superset  of  the  set 
of  views  that  could  be  created  by  a  calibrated  camera 
taking  pictures  of  the  object.  The  natural  question  is 
whether  this  superset  can,  nevertheless,  be  realized  by 
a  calibrated  camera.  In  other  w’ords,  if  we  have  a  cal¬ 
ibrated  camera  (or  we  know  that  the  internal  camera 
parameters  remain  fixed  for  all  views),  then  can  w’e  gen¬ 
erate  Sn-  and  if  so  how’?  This  question  was  addres.sed 
first  in  [12]  but  a.ssuming  only  orthographic  views.  .A 
more  general  result  is  expressed  in  the  following  propo¬ 
sition: 

Propositiou  1  (liven  an  arbitrary  view  i.„  €  S,,  gener¬ 
ated  by  a  camera  with  COP  at  initial  position  O.  then  all 
other  views  v  E  S,,  can  he  generated  by  a  rigid  motion 
of  the  camera  frame  from  its  initial  position,  if  in  addi¬ 
tion  to  taking  pictures  of  the  object  ive  allow  any  finite 
sequence  of  pictures  of  pictures  to  he  taken  as  well. 

The  proof  has  a  trivial  and  a  less  trivial  component. 
The  trivial  part  is  to  show’  that  an  affine  motion  of  the 
camera  frame  can  be  decomposed  into  a  ri^id  motion 
followed  by  .some  arbitrary  collineation  in  V~.  The  less 
trivial  component  is  to  show  that  any  collineation  in  V~ 
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can  be  created  by  a  finite  sequence  of  views  of  a  view 
where  only  rigid  motion  of  the  camera  frame  is  allowed. 
The  details  can  be  found  in  Appendix  A. 

The  next  section  treats  the  projective  case.  It  will 
be  shown  that  this  involves  looking  for  invariants  that 
remain  fixed  when  any  two  views  of  ^  are  chosen.  The 
section  may  be  skipped  if  the  reader  wishes  to  get  to 
Part  11  of  the  paper  —  only  results  of  affine-depth  are 
used  there. 

4  Projective  Structure  and  Invariant 
From  Two  Perspective  Views 

Affine  depth  required  the  construction  of  a  single  ref¬ 
erence  plane,  and  for  that  reason  it  Wcis  necessary  to 
require  that  one  view  remained  fixed  to  serve  as  a  ref¬ 
erence  view.  To  permit  an  invariant  from  any  pair  of 
views  of  we  should,  by  inference,  design  the  construc¬ 
tion  such  that  the  invariant  be  defined  relative  to  two 
planes.  By  analogy,  we  will  call  the  invariant  "projec¬ 
tive  depth"  [29],  This  is  done  c»s  follows. 

We  assign  the  coordinates  ( 1, 0, 0, 0),  (0, 1.0, 0)  and 
(0,0, 1,0)  to  Pi.Pt.Ps,  respectively.  The  coordinates 
(0,0,0, 1)  are  assigned  to  a  fourth  point  P4.  and  the  co¬ 
ordinates  (1, 1, 1. 1)  to  the  COP  of  the  first  camera  O 
(see  Figure  3).  The  plane  passing  through  Pi,  P^,  P3  is 
denoted  by  tti  (as  before),  and  the  plane  passing  through 
P\  P3,Pa  is  denoted  by  tto.  Note  that  the  line  OP4  in¬ 
tersects  iz\  at  (1, 1, 1,0),  and  the  line  OPn  intersects  tto 
at  (1,0, 1,1). 

As  before,  let  A\  be  the  collineation  from  the  im¬ 
age  plane  to  itj  by  satisfying  AiPj  S  ej,  j  =  1 4, 

where  ei  =  (l,0,0),e2  =  (0,l,0),e3  =  (0,0,1)  and 
64  =  (1, 1, 1).  Similarly,  let  Ei  be  the  collineation  from 
the  image  plane  to  ttt  by  satisfying  Pipi  =  ei.Ejpo  S 
e^.Eips  =  Co  and  P1P4  —  63.  Note  that  if  Aip  = 
then  Eip  =  (/?  —  a,/?  —  7,/^).  We  have  there¬ 


fore,  that  the  intersection  of  the  line  OP  with  tti  is  the 
point  Px,  =  (o,  J,  7,0),  and  the  intersection  with  tv  is 
the  point  P^,  =  ( J  —  a,  0.  J  —  7  ,  J).  We  can  express  P 
and  O  as  a  linear  combination  of  those  points: 
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Consider  the  cross  ratio  k/k'  of  the  four  points 
O.  P^^.  P^,.  P.  Note  that  k'  =  1  independently  of  P. 
therefore  the  cross  ratio  is  simply  k.  As  in  the  affine 
case.  K  is  invariant  up  to  a  uniform  scale,  and  any  fifth 
object  point  P„  (not  lying  on  any  face  of  the  tetrahe¬ 
dron  P\.  P->,  P3.  PO  can  be  assigned  Ko  =  1  by  choos¬ 
ing  the  appropriate  scale  for  .4i  (or  Pi).  This  ha.s 
the  effect  of  mapping  the  fifth  point  P„  onto  the  COP 
(P,  =  (].  1. 1,  I)).  We  have,  therefore,  that  k  (normal¬ 
ized)  is  a  projective  invariant,  which  we  call  "projective 
depth".  Relative  shape  is  described  as  the  ratio  of  a 
point  from  two  planes,  defined  by  four  object  points, 
along  the  line  to  a  fifth  point,  which  is  also  the  center 
of  projection,  that  is  set  up  such  that  its  ratio  from  the 
two  planes  is  of  unit  value.  Any  transformation  T  G  GL,^ 
will  leave  the  ratio  k  invariant.  What  remains  is  to  show 
how  K  can  be  computed  given  a  second  view. 

Let  .4  be  the  collineation  between  the  two  image  planes 
due  to  tTi.  i.c.,  Apj  =  p' .  j  =  1.2,3,  and  .4i  =  r',  where 
r.  v'  are  the  epipoles.  Similarly,  let  E  be  the  collineation 
due  to  ttt,  i.e.,  Epj  =  p',  j  =  1.3,4,  and  Pc  =  c'.  Note 
that  three  corresponding  points  and  the  corresjionding 
epipoles  are  .sufficient  for  comonting  the  collineation  due 
to  the  plane  projecting  onto  the  three  points  in  both 
views  this  is  clear  from  the  derivation  in  Section  3. 
but  also  can  be  found  in  [28.  29,  23].  We  have  that  the 
projections  of  P^,  and  P^^  onto  the  second  image  are 
captured  by  .4p  and  Ep.  respectively.  Therefore,  the 
cross  ratio  of  O.  P», .  P,r,.  P  is  equal  to  the  cross  ratio  of 
c',.4p.  Ep,p' .  which  is  computed  as  follows; 

p'  =  Ap  —  sEp. 

v'  =  .4p  —  s'  Ep. 

then  fc  =  s/s' ,  up  to  a  uniform  scale  factor  (which  is  set 
using  a  fifth  point).  Here  we  can  also  show  that  s'  is  a 
constant  independent  of  p.  There  is  more  than  one  way 
to  show  that,  a  simple  way  is  as  follows:  Let  q  be  an 
arbitrary  point  in  the  first  image.  Then, 

v'  ^  Aq-  s'^Eq. 

Let  //  be  a  matrix  defined  hy  H  =  A  —  s'^E.  Then,  c'  S 
Hr  and  v'  =  Hq.  This  could  happen  only  if  v'  2?  Hp. 
for  all  p,  and  s'  =  s'^-  We  have  arrived  to  a  very  simple 
algorithm  for  recovering  a  projective  invariant  from  two 
perspective  (orthographic)  views: 

p'  S  Ap  —  kEp, 


6 


(2) 


where  .4  and  E  are  described  above,  and  k  is  invariant 
up  to  a  uniform  scale,  which  can  be  set  by  observing  a 
fifth  correspondence  i  e.,  set  the  scale  of  £'  to  sat¬ 

isfy  p',j  2  Apu  —  Epo-  I  nlike  the  affine  case,  k  is  invariant 
for  any  two  views  from  the  set  of  all  possible  views. 
Note  that  k  need  not  be  normalized  using  a  fifth  point, 
if  the  first  view  remains  fixed  (we  are  back  to  the  affine 
case).  VVe  have  arrived  to  the  following  result,  which  is 
a  refinement  on  the  general  result  made  in  [4]  that  five 
corresponding  points  and  the  corresponding  epipoles  are 
sufficient  for  reconstruction  up  to  a  collineation  in 

In  cast  when  the  location  of  epipoles  are  known, 
then  four  corresponding  points,  coming  from  four  noii- 
coplanar  points  in  spare .  are  sufficie  nt  for  computing  the 
projectile  structure,  up  to  a  uniform  but  unknown  scale, 
for  all  other  points  in  space  projecting  onto  correspond¬ 
ing  points  in  both  news.  ,4  fifth  corresponding  point, 
coming  from  a  point  in  general  position  with  the  other 
four  points,  can  be  used  to  set  the  scale. 

We  have  also. 

Projective  shape  can  be  described  as  the  ratio  of  a  point  P 
from  two  faces  of  the  tetrahedron,  normalized  bg  the  ra¬ 
tio  of  a  filed  point  (the  unit  point  of  the  reference  frame) 
from  tho.se  faces. 

The  practical  implication  of  this  derivation  is  that  a 
projective  invariant,  such  as  the  one  described  here,  is 
worthwhile  computing  for  tasks  for  which  we  do  not  have 
a  fixed  reference  view  available.  Worthwhile  because 
projective  depth  requires  an  additional  corresponding 
point,  and  requires  slightly  more  computations  (recover 
the  matrix  E  in  addition  to  .4).  Such  a  task,  for  ex¬ 
ample,  is  to  update  the  reconstructed  structure  from  a 
moving  stereo  rig.  At  each  time  instance  we  are  given  a 
pair  of  views  from  which  projective  depth  can  be  com¬ 
puted  (projective  coordinates  follow  trivially),  and  since 
both  cameras  are  changing  their  position  from  one  time 
instant  to  the  next,  we  cannot  rely  on  an  affine  invariant. 

5  Summary  of  Part  I 

Given  a  view  t  o  with  image  points  p,  there  exists  an 
affine  invariant  k  between  V  o  and  any  other  view  gj.  with 
corresponding  image  points  p',  satisfying  the  following 
equation: 

pp'  =  Ap  —  kv' , 

where  A  is  the  collineation  between  the  two  image  planes 
due  to  the  projection  of  some  plane  ici  projecting  to  both 
views,  and  v'  is  the  epipole  scaled  such  that  PoP'o  — 
Apo  —  v'  for  some  point  p„.  The  set  of  ail  views  <5^,  for 
which  the  camera’s  center  is  not  on  tti  will  satisfy  the 
equation  above  against  V’o-  The  view  is  a  reference 
view. 

A  projective  invariant  k  is  defined  between  any  two 
views  V’i  and  V’j  i  again  for  the  sake  of  not  introducing 
new  notations,  projecting  onto  corresponding  points  p 
cind  p',  respectively.  The  invariant  satisfies  the  following 
equation: 

pp'  =  Ap  —  ecEp, 


where  .4  is  the  collineation  due  to  some  plane  jti  .  and 
E  is  the  collineation  due  to  some  other  plane  Wj  scaled 
such  that  p„p',  =  .4p,,  —  Ep^.  for  some  point  p,.. 

Part  II 

6  Algebraic  Functions  of  Views 

In  this  part  of  the  paper  we  use  the  results  established  in 
Section  3  to  derive  results  of  a  different  nature:  instead 
of  reconstruction  of  shape  and  invariants  we  would  like  to 
establish  a  direct  connection  between  views  expressed  as 
a  functions  of  image  coordinates  alone  which  we  will 
call  "algebraic  functions  of  views".  With  these  functions 
one  can  manipulate  views  of  an  object,  such  as  create 
new  views,  without  the  need  to  recover  shape  or  camera 
geometry  as  an  intermediate  step  all  what  is  needed 
is  to  appropriately  combine  the  image  coordinates  of  two 
reference  views. 

Algebraic  functions  of  two  views  include  the  expression 

p'^Tp=0.  (3) 

where  E  is  known  as  the  "Fundamental"  matrix  (cf.  [4]) 
(a  projective  version  of  the  well  known  "Kssential"  ma¬ 
trix  of  [Ki]).  and  the  expression 

oiJ-'  A  ir>y'  +  03X  +  04.1/  -I-  05  =  0  (4) 

due  to  [10],  which  is  derived  for  orthographic  views. 
These  functions  express  the  epipolar  geometry  between 
the  two  views  in  the  perspective  and  orthographic  ca.s«*s. 
respectively.  Algebraic  functions  of  three  views  were  in¬ 
troduced  in  the  past  only  for  orthographic  views  [31.  21]. 
For  example, 

fi]  j-"  -t-  n  ,x'  -F  03X  +  114!/  -f  Or,  =  0. 

These  functions  express  a  relationship  between  the  im¬ 
age  coordinates  of  one  view  as  a  function  of  image  co¬ 
ordinates  of  two  other  views  —  in  the  example  above, 
the  X  coordinate  in  the  third  view,  x" .  is  expressed  as  a 
linear  function  of  image  coordinates  in  two  other  views, 
similar  expressions  exist  for  y" . 

We  will  use  the  affine-depth  invariant  result  to  de¬ 
rive  algebraic  functions  of  three  perspective  views.  The 
relationship  between  a  perspective  ’ci  ’.v  and  two  other 
perspective  views  is  shown  to  be  trilinear  in  image  coor¬ 
dinates  across  the  three  views.  The  relationship  is  shown 
to  be  bilinear  if  two  of  the  views  are  orthographic  -  -  a 
special  c^lse  useful  for  recognition  tasks.  We  will  start  by 
addressing  the  two-view  case.  We  will  use  Equation  1  to 
relate  the  entries  of  the  camera  transformation  ,4  and  v' 
(of  Equation  1)  to  the  fundamental  matrix  by  showing 
that  F  =  v'  X  A.  This  also  has  an  advantage  of  introduc¬ 
ing  an  alternative  way  of  deriving  expressions  3  and  4,  a 
way  that  also  puts  them  both  under  a  single  framework. 

6.1  Algebraic  Functions  of  Two  Views 

Consider  Equation  1,  reproduced  below. 


By  simple  niaiiipulation  of  this  equation  we  obtain: 

v'«:i  /'  -  a_.  p 


/  /  O'; 

j  'ttj  p  -  u  a\  /' 

where  a,.o_..a:i  are  the  row  vectors  of  .1  and  i'  - 
(ij  (•',.1',).  After  ei|uating  tile  first  two  leriiis,  w  oli- 
tain: 

j-'(  r'.a^  p  -  (/,«•_,  p)  +  ,v'(  I  .',Oi  p  -  '  a-)  /-)  -f 
(1  |Oj  p  -  (/.ai  p)  =  U.  (0) 

Note  that  the  terms  within  parentheses  are  linear  poly¬ 
nomials  in  with  fixed  coefficients  (i.e..  depend  (Mlly 
on  .1  and  1').  Also  note  that  we  get  the  same  expres¬ 
sion  when  equating  the  first  and  third,  or  tliest'coiid  and 
third  terms  of  i-a|uation  ■’).  I'liis  leads  to  the  following 
result ; 


k  = 


1 


'3 


j-  ua  /<  -  ai  p 


Ihi  tmagt  roordiiiatf  s  (r.y)  and  (t'.y')  of  tiro  cont- 
spondtng  points  nrross  tiro  pirsptrtin  rii  irs  satisfy  a 
iinir/ui  iquation  of  th(  follou  ing  fonii: 


+  o;()  +  +  Or.,)/  +  o,;)  + 

<\-j-  +  o«,iy  +  (».,  =  (I.  (7) 

irhrrf  thi  coffficKiits  iij.  j  —  1 !).  han  a  fiitd  il¬ 
lation  to  thf  rainr ra  transformation  .1  and  r'  of  hgua- 
tion  1: 


<»1 

- 

I'jfpu  - 

<'3(>'n 

= 

13(130  - 

^>3 

= 

<'3(133  - 

('[(('r.i 

'.3«11  - 

<''C<31 

fir> 

>'3«l-’  - 

|■l((32 

Ot; 

= 

<3^*13  — 

t  'l  «33 

=: 

i  'l  f(21  - 

'2«ll 

‘■*8 

I'Jo'jv  — 

^ 1 2 

f»;i 

tV'23  - 

''2^13 

Equation  7  can  also  be  written  a.s  p'*  F p  —  0,  where 
the  entries  of  the  matrix  F  are  the  coefficients  <\j.  and 
therefore.  F  =  c'x  A.  W'e  have.  thus,  obtained  a  new  and 
simple  relationship  between  the  elements  of  the  "funda¬ 
mental”  matrix  F  and  the  elements  of  the  camera  trans¬ 
formation  .4  and  r'.  It  is  worth  noting  that  this  result 
can  be  derived  much  easier,  as  follows.  First,  the  rela¬ 
tionship  p'*Fp  =  0  can  be  derived,  as  observed  by  [4], 
from  the  fact  that  F  is  a  correlation  mapping  points 
p  onto  their  corresponding  epipolar  lines  /'  in  the  sec¬ 
ond  image,  and  therefore  p'  •  /'  =  0.  Second',  since 
/'  S  1'  X  .4p.  we  have  F  =  1'  x  A.  It  is  known  that 
the  rank  of  the  fundamental  matrix  is  2:  we  can  use  this 
relationship  to  show  that  as  well: 


F  =  1-'  X  .4  = 


I'tOs  —  I'gOo 
1-30,  -  e',a3 
r'lOT— t’Tai 


'This  was  a  comment  made  by  Tuan  Luong. 
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where  U| .  a-j .  03  are  t  he  row  vect  ors  of  .4 .  Let  / 1 .  /  . .  /y 
be  the  row  vecto''s  of  F,  then  it  IS  easy  to  verify  that 

bv  .mg 

(\l';  =  .il\. 

Next,  we  can  U.se  the  result  /  =  1'  >  .4  to  show  how 
the  orthographic  case,  treated  by  [lO].  lit.s  this  relation¬ 
ship.  Ill  the  framework  of  liquation  1.  wi-  saw  that  with 
orthographic  views  we  havi-  .4  being  affine  in  P-.  1  e  . 
(/;,  p=  1.  and  1 3  =  0  .After  substitution  in  E<|uation  ti. 
We  obtain  the  equation: 

0|j-'  -f  o-j//'  -I-  o:)j'  -f  o.).(/-f  0-,  =  0.  (M) 

where  the  coefficients  o^,  j  =  1 . have  the  following 

values: 


'•1 

— 

1  . 

O) 

/ 

1- 

0:1 

''|"21  - 

r.pii  1 

'M 

'  'l"22  - 

'■F'12 

0-, 

'  '|<l23  - 

'3«I3 

riles*-  coefhcii-nts  are  also  the  entries  of  the  fundamental 
matrix,  which  can  also  be  derived  from  F  =  r'  *  .1  by 
setting  1 3  =  U  and  ttg  =  (O.l).  1). 

File  algebraic  function  7  can  be  used  for  re-pro)»'ction 
onto  a  third  view,  by  simply  noting  that  the  function  be¬ 
tween  view  1  aiul  :L  and  the  function  betw*>en  view  2  and 
d.  proviih*  two  eipiations  for  solving  for  (g".,(/").  I'liis 
wa.s  proposed  in  the  past,  in  various  forms,  by  [20.  d.  19]. 
Since  the  algebraic  function  express*»s  the  epipolar  geom¬ 
etry  between  the  two  views,  however,  a  solution  can  be 
found  only  if  the  COPs  of  the  three  cameras  are  non- 
collin<*ar  (cf.  [28.  27])  --  which  can  lead  to  numerical 
instability  unless  the  COPs  are  far  from  collinear.  The 
alternative,  as  shown  next,  is  to  derive  *lirectly  alge¬ 
braic  functions  of  three  views.  In  that  case,  the  coor¬ 
dinates  (x".,t/")  are  solved  for  separately,  each  from  a 
single  equation,  without  problems  of  singularities. 

6.2  Algebraic  Functions  of  Three  Views 

Consider  Equation  1  applied  between  view  1  and  2.  and 
between  view  1  and  d: 


pp'  —  Ap  —  kv' 

op"  =  Bp  —  kv" .  (9) 


Here  we  make  us'  of  the  result  that  affin*--depth  k  is 
invariant  for  any  v.  w  in  reference  to  the  first  view.  We 
can  isolate  k  again  from  Equation  9  and  obtain: 


k  = 


V,  -  JT  r. 


F'63  ■  p  -  61  p 


ry-p"r» 

y"b3  p  -  b-j  p 

..0  —  0  „o  .Jt 

y  f|  -  J  1-3 
x"bn  ■  p  —  y"b\  ■  p 


(10) 


where  bi.bs.bj  are  the  row  vectors  of  B  and  t"  = 
(t",  v".  V3).  Because  of  the  invariance  of  k  we  can  equate 
terms  of  Equation  5  with  terms  of  Equation  10  and  ob¬ 
tain  trilinear  functions  of  image  coordinates  across  three 


views  Kor  example,  hy  e(|iialiiig  llie  liist  two  teriits  in 
eacll  of  tile  equations,  we  olitaili; 

j-"(r\h^  p-  I  'j'a;  /<) +  j  (  '/«»:)  p-  (  p)  + 

j-'(  ( i>  -  i  />)  +  (','«!  P  -  /'  =  (II) 

fills  leails  to  tile  following  result: 

Thf  iiiKigi  (oordinali  s  {J'.t/}.  mid  of 

thru  Kintiiioiidiiiiipointsairossthni  iiirsiiirtin  riiirs 
'•alisfij  (1  tnlnii  III  t  quiilion  of  Iht  following  form: 

J'"{  o  1  j'  +  o  _>y  +  o.) )  +  r" J"' ( o.|j‘  +  o:,  i/  +  o,, )  + 

j  * ( (I  7J'  +  Os.i/  +  o  . )  +  o  [11  J'  +  o  1 1 ,1/  +  o  1  •_>  =  0.  (  12 ) 

iihtri  Ihi  I  01  f]i(  i<  Ills  <ij  .  j  =  1 .  12,  hmi  11  find  itln- 

lioii  to  Ihi  ‘  itiiii  III  ti iiiisforiiiiitioii^  1)1  In  n  II  Hit  first  inn 
mid  Ihi  othi  I  two  i  n  ws. 

Note  t  liat  t  he  J-  cooriliiiate  in  t  lie  third  view,  j  " .  is  ol>- 
tained  as  a  solution  of  a  single  equation  iti  coorditiates  of 
the  ot  her  two  views.  'Idle  coefficieiits  Oj  can  he  recovered 
as  a  soliit  ion  of  a  linear  system,  direct  ly  if  we  observe  1 1 
corresponding  points  across  the  three  views  (more  than 
11  points  cati  he  used  for  a  least-sipiares  solution),  or 
with  fewer  points  hy  first  recovering  the  elements  of  the 
camera  t raiisfortiis  ;is  described  in  Section  .‘i.  fheii,  for 
atiy  additional  poitit  {Ji'.g)  whose  corres[ioiidetice  in  the 
second  itiiage  is  ktiowti  {x'.y').  wo  can  recover  the  corre- 
sponditig  r  coorditiate,  jr" .  in  the  third  view  by  substi¬ 
tution  iti  eqiintioti  12. 

In  a  similar  fashion,  after  eijuating  the  first  tertn  of 
K(|uation  ■■)  with  the  secotid  term  of  Ecjuatioti  U).  we 
obtain  an  e{|uation  for  p"  as  a  futiction  of  the  two  other 
vii'ws: 

,(/"( JiJ-  -f  .1,11  -f-  .f;, )  y“r'(  +  J-,)/  -h  .i. )  + 

-f  .fs.i/  +  ’fi)  +  i.V  +  ‘fr.’  =  ()■  ( Id) 

fakeii  together,  Kcpiatioiis  -5  atid  H)  lead  to  9  algebraic 
functions  of  three  views,  six  of  which  are  separate  for  x" 
and  p".  file  other  four  functions  are  listed  below; 

■'•"() +  ^'V() +  .'/(•)  +  (■)  =  0.  (Id) 

.|/"(■)  +  .v".v'(•)  +  .v'(■) +(■)  =  ().  (la) 

j-'V()  +  r"p'()  +  /()  +  p'(  ■)  =  ().  (1«) 

p'V()  +  pV()  +  ^'()  +  .v'()  =  ()•  (IT) 

where  (■)  represent  linear  polynomials  in  x.p.  The  so¬ 
lution  for  x" .  y"  is  unique  without  constraints  on  the 
allowed  camera  transfortnations.  If  we  choose  Ecpta- 
tions  12  and  Id.  then  I'l  and  should  not  vanish  si- 
tniiltaneously.  i.e..  1'  S  (0.  1.0)  is  a  singular  case.  Also 
(  "  =  (0.  1.0)  and  1"  =  (1.0.0)  give  rise  to  singular  cases. 
One  can  easily  show  that  for  each  singular  case  there 
are  two  other  functions  out  of  the  nine  available  ones 
that  provide  a  unique  solution  for  x" .  y" .  Note  that  the 
singular  cases  are  pointwise,  i.e..  only  three  epipolar  di¬ 
rections  are  excluded,  compared  to  the  much  stronger 
singular  czise  when  the  algebraic  function  of  two  views  is 
used  separately,  as  described  in  the  previous  section. 

Taken  together,  the  process  of  generating  a  novel  view 
can  be  easily  accomplished  without  the  need  to  explicitly 


recover  structure  (affine  depth),  camera  transformation 
(matrices  .1  H  and  epipoles  1'.  (")  or  epipolar  geometr\ 
(  just  the  epipoles  or  the  Fundamental  matrix)  for  the 
price  of  using  more  than  the  minimal  number  points  that 
are  reipiired  otln'rwise  (the  minimal  is  six  between  the 
two  model  vit'WS  and  the  novel  third  view  ) 

rile  connection  between  the  general  result  of  trihiiear 
function^  of  views  to  the  "linear  combination  of  view>" 
result  [.dl]  for  orthograidiic  views,  can  easily  be  seen  by 
.setting  ,1  and  H  to  lie  affine  in  P' .  and  1 =  1."  =  t) 
For  example,  Fi|iiaIioii  11  reduces  to: 

ryj‘  -  /  J  J'  -I-  ( (  .  a\  p  -  -  p)  =  U.  (  l^) 

which  is  of  t  he  form: 

n\x"  +  i)jj'  -f  n:fX  -)-  o.(P  -F  0,7  =  0 

In  the  case  where  all  three  views  are  orthographic,  then 
x"  is  expres.se<l  as  a  linear  combiiial  ion  of  image  coordi¬ 
nates  of  the  two  other  views  as  discovered  by  [dl]. 

In  the  next  section  we  address  another  case,  interme¬ 
diate  between  the  general  trilinear  and  the  orthographic 
linear  functions,  which  we  find  interesting  for  apjilica- 
tions  of  visual  recognition 

G.2.1  R«‘ct)gnitioii  of  Pnrspoctivo  views  From 
ail  Orthographic  Model 

Consider  the  ca.se  for  which  th<‘  two  reference  (model) 
views  of  an  object  are  taken  orthographically  (using  a 
tele  lens  Would  provide  a  reasonable  approximation),  but 
d''ring  recognition  any  perspective  view  of  the  object  is 
allowed.  It  can  easily  be  shown  that  the  three  views  are 
then  connected  via  a  bilinear  function  (instead  of  trilin¬ 
ear):  .1  is  affine  in  p-  and  1  •  j  =  0.  tli('refore  Fiiuation  11 
reduces  to: 

X  (ri6:i  ■  p  -  r;,ai  ■  p)  +  x  - 
r\'x'  +  (r'l'ai  ■  />  -  r\h)  y)  =  0. 

which  is  of  the  following  form: 

x"(it\x  +  io>y  -t-  0:5)  +  n^x"x'  -F 

or,j  '  -F  nr.x  -f  071/  -F  os  =  0.  ( 19) 

Similarly.  Equation  Id  reduces  to 

y"{  diX  -F  .hy  -F  ^3)  -F  d^y'x'  + 

Hr>x  -F  dtix  -F  ■lTy~l'  ds  =  0.  (20) 

A  bilinear  function  of  three  views  has  two  advantages 
over  the  general  trilinear  function.  First,  only  seven  cor¬ 
responding  points  (instead  of  11)  across  three  views  are 
required  for  solving  for  the  coefficients  (compared  to  the 
minimal  six  if  we  first  recover  A.B.i'.r").  Second,  the 
lower  the  degree  of  the  algebraic  function,  the  less  sen¬ 
sitive  the  solution  should  be  in  the  presence  of  errors  in 
measuring  correspondences.  In  other  words,  it  is  likely 
(though  not  necessary)  that  the  higher  order  terms,  such 
as  the  term  x"x'x  in  Equation  12.  will  liave  a  higher  con¬ 
tribution  to  the  overall  error  sensitivity  of  the  system. 

Compared  to  the  CcLse  when  all  views  are  assumed  or¬ 
thographic.  this  case  is  much  less  of  an  approximation. 
Since  the  model  views  are  taken  only  once,  it  is  not  un¬ 
reasonable  to  require  that  they  be  taken  in  a  special 


naru'  ly.  with  a  tele  lens  (assiiiniiig  we  are  dealing 
ohjeri  recognition,  rather  than  scene  recognition). 
It  re(|uireinent  is  satisfied,  then  the  recognition  task 
leral  since  we  allow  any  perspective  view  to  be  taken 
ig  the  recognition  process. 

Applications 

\  hraic  functions  of  views  allow  the  manipulation  of 
es  of  objects  without  necessarily  recovering  3D 
ture  or  any  form  of  camera  geometry  (either  full,  or 
the  epipoh's) 

le  applicat ion  that  was  emphasized  throughout  the 
I  r  i'  visual  recognition  via  alignment.  In  this  con- 
I  the  general  result  of  a  trilinear  relationship  between 
'  s  is  not  encouraging.  If  we  want  to  avoid  implicating 
•ture  and  camera  geometry,  we  must  have  11  corre- 
dmg  points  across  the  three  views  —  compared  to 
loinis.  otherwise.  In  practice,  however,  we  would 
I  more  than  the  minitnal  number  of  points  in  or- 
'  o  obtain  a  least  srpiares  solution.  The  question  is 
’  her  the  simplicity  of  the  method  using  trilinear  func- 
'I  -  translates  also  to  increased  robustness  in  practice 
V,  1  many  |>oints  are  used  this  is  an  open  question. 

ill  in  the  context  of  recognition,  the  existence  of  bi¬ 
ll!  r  filiations  in  the  special  ca.se  where  the  model  is 
tgraphic.  but  the  novel  view  is  perspective,  is  more 
uraging.  Here  we  have  the  result  that  only  seven  cor- 
uiditig  points  are  reipiired  to  obtain  recognition  of 
pect've  views  (proviileil  We  can  satisfy  the  require- 
t  that  the  model  is  orthographic)  compared  to  six 
ts  when  structure  and  camera  geometry  are  recov- 
riie  additional  corresponding  pair  of  points  may 
ideed  worth  the  greater  simplicity  that  comes  with 
.iiig  with  algebraic  functions. 

lere  may  I'xist  other  applications  where  simplicity 
major  importatice,  wliereas  the  number  of  points 
ss  of  a  coticern.  Consider  for  example,  the  appli- 
1'  >11  of  model-based  compression.  With  the  trilinear 
t  n  tions  we  need  2"2  parameters  to  represent  a  view  as 
iction  of  two  reference  views  in  full  correspondence. 
\  ime  both  the  sender  and  the  receiver  have  the  two 
I  ence  views  and  apply  the  same  algorithm  for  obtain- 
I  -orrespondeiices  between  the  two  views.  To  send 
I  ir<l  view  (ignoring  problems  of  self  occlusions  that 
•  li  1  be  dealt  separately)  the  sender  can  solve  for  the 
arameters  using  many  points,  but  eventually  send 
i!i  the  22  parameters.  The  receiver  then  simply  com- 
I  ih  s  the  two  reference  views  in  a  "trilinear  way"  given 
I  h  -eceived  parameters.  This  is  clearly  a  domain  where 
ill  lumber  of  points  are  not  a  major  concern,  whereas 
-II  licity.  and  probably  robustne.ss  due  to  the  short-cut 
III  :  e  computations,  is  of  great  importance. 

i  dated  to  image  coding  is  a  recent  approach  of  image 
'  mposition  i’lto  "layers"  as  proposed  in  [1,  2],  In  this 
i;  oach,  a  sequence  of  views  is  divided  up  into  regions, 

'  -le  motion  of  each  is  described  approximately  by  a 
■Jl  >  iffine  transformation.  The  sender  sends  the  first  im- 
i-'  followed  only  by  the  .six  affine  parameters  for  each 
r>  l;.  >n  for  each  subsequent  frame.  The  use  of  algebraic 
In:  tions  of  views  can  potentially  make  this  approach 
Mc  ■  powerful  because  instead  of  dividing  up  the  scene 


into  planes  (it  would  have  planes  if  the  projection  was 
parallel,  in  general  its  not  even  planes)  one  can  attempt 
to  divide  the  .scene  into  objects,  each  carries  the  22  pa¬ 
rameters  describing  its  displacement  onto  the  subsequent 
frame. 

Another  area  of  application  may  be  in  computer 
graphics.  Re-projection  techniques  provide  a  si.  rt-cut 
for  itnage  rendering,  (liven  two  fully  rendered  Views 
of  some  3D  object,  other  views  (again  ignoring  self¬ 
occlusions)  can  be  rendered  by  simply  ■combining"  the 
reference  views.  Again,  the  number  of  corres|>onding 
points  is  less  of  a  concern  here. 

8  Summary  of  Part  II 

The  derivation  of  an  affine  invariant  across  perspective 
views  in  Section  3  was  used  to  derive  algebraic  func¬ 
tions  of  image  coordinates  acro.ss  two  and  three  view.-. 
The.se  enable  the  generation  of  novel  views,  for  pur|>oses 
of  visual  recognition  and  for  other  applications,  without 
going  through  the  |)rocess  of  recovering  object  structure 
(metric  or  non-metric)  and  camera  geometry. 

Between  two  views  there  exists  a  unique  function 
wdiose  coefficients  are  the  elements  of  the  Fundamental 
matrix  and  were  shown  to  be  related  explicitly  to  the 
camera  transformation  .-1,  v': 

J’*(t>  1  J'  -f-  O-.V  "b  O3 )  .1/  ( O.J J'  -b  0 ,5 .(/  -t-  (»o )  -t- 

07J*  “|-  “b  03  —  0. 

The  derivation  was  also  useful  in  making  the  connection 
to  a  similar  expression,  due  to  [lU],  made  in  the  context 
of  orthographic  views. 

We  have  seen  that  trilinear  functions  of  image  coordi¬ 
nates  exist  across  three  views,  one  of  them  shown  below: 

j-"(oi  J*  -b  02.y  +  03)  -b  -b  051/  -b  oc)  -b 

x'(o7j-  -b  ligy  -b  Oil)  -b  oi().c  +  1 1  ,</•+■  r.>  =  b- 

In  case  two  of  the  views  are  orthographic,  a  bilinear  re¬ 
lationship  across  three  views  holds.  For  example,  the 
trilinear  function  above  reduces  to; 

j’”(ni  J'  -b  o-t/  -b  03)  -b  x'  -b 

o,5j-'  -b  tyfyX  -b  ori/  -b  Os  =  U. 

In  case  all  three  views  are  orthographic,  a  linear  rela¬ 
tionship  holds  —  as  observed  in  [31]: 

O]  J-"  -b  n->x'  -b  03J  -b  o^t/  -b  05  —  0. 

9  General  Discussion 

For  purposes  of  visual  recognition,  by  alignment,  the 
transformations  induced  by  changing  viewing  positions 
is  at  most  affine.  In  other  words,  a  pin-hole  uncalibrated 
camera  is  no  more  than  an  "affine  engine"  for  tasks  for 
which  a  reference  view  (  a  model)  is  available.  One  of 
the  goals  of  this  paper  was  to  make  this  claim  and  make 
use  of  it  in  providing  methods  for  affine  reconstruction 
and  for  recognition. 

An  affine  reconstruction  follows  immediately  from 
Equation  1  and  the  realization  that  ,4  is  a  collineation 
of  some  plane  which  is  fixed  for  all  views.  The  recon¬ 
structed  homogeneous  coordinates  are  (x.y,  l.A')  where 


1)  are  the  homogeneous  coordinates  of  the  image 
plane  of  the  reference  view,  and  k  is  an  affine  invariant. 
The  invariance  of  k  can  be  used  to  generate  novel  views 
of  the  object  (whiclt  are  all  afhneK  related  to  the  refer¬ 
ence  view),  and  thus  achieve  recognition  via  alignment. 
We  can  therefore  distinguish  between  affine  and  non- 
affine  transformations  in  the  context  of  recognition:  if 
the  object  is  fixed  and  the  transformations  are  induced 
by  camera  displacements,  then  k  must  be  invariant 
space  of  transformations  is  no  more  than  affine.  If.  how¬ 
ever.  the  object  is  allowed  to  transform  as  well,  then  k 
would  not  remain  fixed  if  the  transformation  is  not  affine, 
i.e.  involves  more  than  translation,  rotation,  scaling  and 
shearing.  For  example,  we  may  apply  a  projective  trans¬ 
formation  in  to  the  object  reiiresentation.  i.e..  map 
five  points  (in  general  position)  to  arbitrary  locations  in 
sjiace  (which  still  remain  in  general  position)  and  map 
all  other  points  accordingly.  This  mapping  allows  more 
■'distortions”  than  affine  transformations  allow,  and  can 
be  detected  by  the  fact  that  k  will  not  remain  fixed. 

Another  use  of  the  affine  derivations  was  expres.sed  in 
Part  II  of  this  paper,  by  showing  the  existence  of  alge¬ 
braic  functions  of  views.  We  have  seen  that  any  view 
can  be  expressed  as  a  trilinear  function  with  two  refer¬ 
ence  views  in  the  general  Ccuse,  or  as  a  bilinear  function 
when  the  reference  view's  are  created  by  means  of  paral¬ 
lel  projection.  These  functions  provide  alternative,  much 
simpler,  means  for  manipulating  view's  of  a  scene.  The 
camera  geometries  between  one  of  the  reference  views 
and  the  other  two  view's  are  folded  into  22  coefficients. 
The  number  22  is  perfectly  expected  because  these  cam¬ 
era  geometries  can  be  represented  by  two  camera  trans¬ 
formation  matrices,  and  we  know  that  a  camera  trans¬ 
formation  matrix  has  11  free  parameters  (3  x  4  matrix, 
tletermined  u[)  to  a  scale  factor).  However,  the  folding 
of  the  camera  transformations  are  done  in  such  a  way 
that  we  have  two  independent  sets  of  1 1  coefficients  each, 
and  each  set  contains  foldings  of  elements  of  both  cam¬ 
era  transformation  matrices  (recall  Equation  11).  This 
enables  us  to  recover  the  coefficients  from  point  corre¬ 
spondences  alone,  ignoring  the  3D  structure  of  the  scene. 
Becau.se  of  their  simplicity,  we  believe  that  the.se  alge¬ 
braic  functions  will  find  uses  in  tasks  other  than  visual 
recognition  —  some  of  those  are  discussed  in  Section  7. 

This  paper  is  also  about  projective  invariants,  mak¬ 
ing  the  point  of  when  do  we  need  to  recover  a  projective 
invariant,  what  adflitional  advantages  should  we  expert, 
and  what  price  is  involved  (more  computations,  more 
points,  etc.).  Before  we  discuss  those  issues,  it  is  worth 
discussing  a  point  or  two  related  to  the  way  affine-depth 
was  derived.  Results  put  aside.  Equation  1  looks  sus¬ 
piciously  similar,  or  trivially  derivable  from,  the  classic 
motion  equation  between  two  frames.  Also,  there  is  the 
question  of  w'hether  it  was  really  necessary  to  use  the 
tools  of  projective  geometry  for  a  result  that  is  essen¬ 
tially  affine.  Finally,  one  may  ask  whether  there  are  sim¬ 
pler  derivations  of  the  same  result.  Consider  the  classic 
motion  equation  for  a  calibrated  camera: 

z'p'  =  zRp+i. 

Here  R  is  an  orthogonal  matrix  accounting  for  the  rota¬ 
tional  component  of  camera  displacement,  t  is  the  trans¬ 


lation  component  (note  that  t  S  v').  c  is  the  depth  from 
the  first  camera  frame,  and  z'  is  the  depth  value  seen 
from  the  second  camera  frame.  Divide  both  sides  of  the 
equation  by  -.  assume  that  R  is  an  arbitrary  non-singular 
matrix  .4.  and  it  seems  that  we  have  arrived  to  Eiiua- 
tion  f.  where  k  =  — 1/c  In  order  to  do  it  right,  one 
must  start  with  an  affine  frame,  ma])  it  affinely  onto  the 
first  camera,  then  map  it  affinely  onto  the  second  cam¬ 
era.  and  then  relate  the  two  mappings  together  -  it  will 
then  become  clear  that  k  is  an  invariant  measurement. 
This  derivation,  which  we  will  call  an  ''affine  derivation”, 
appears  to  have  the  advantage  of  not  using  projective  ge¬ 
ometry.  However,  there  are  some  critical  pieces  missing. 
First,  and  foremost,  we  have  an  equation  but  not  an  al¬ 
gorithm.  We  have  seen  that  simple  equation  counting 
for  solving  for  .4  and  k.  given  t.  from  point  corres|)on- 
dences  is  not  sufficient,  because  the  system  of  equations 
is  singular  for  any  number  of  corresponding  iioints.  .Also 
e<|uation  counting  does  not  reveal  the  fact  that  only  four 
points  are  necessary:  three  for  .4  and  the  fourth  for  set¬ 
ting  a  mutual  scale.  Therefore,  the  realization  that  .1  is 
a  homography  of  some  plane  that  is  fixed  along  all  views 
a  fact  that  is  not  revealed  by  the  affine  ilerivation 
IS  crucial  for  obtaining  an  algorithm.  Second,  the  na¬ 
ture  of  the  invariant  measurement  k  is  not  comiih'lely 
revealed;  it  is  not  (inverse)  di'pth  because  .4  is  not  nec- 
e.ssarily  orthogonal,  and  all  the  other  results  described 
in  Section  3.2  do  not  clearly  follow  either. 

Consider  next  the  question  of  whether,  within  the  con¬ 
text  of  projective  geometry,  affine-depth  could  have  been 
derived  on  geometric  grounds  without  setting  up  coor¬ 
dinates.  as  we  did.  For  example,  althougli  this  was  not 
mentioned  in  Section  3.  it  is  clear  that  the  three  points 
p'.Ap.v'  are  collinear  —  this  is  well  known  and  can  be 
derived  from  purely  geometric  considerations  by  observ- 
ing  that  the  optical  line  OR  and  the  epipolar  line  p'v' 
are  projectively  related  in  (cf.  [28.  29.  22]).  It  is  le.ss 
obvious,  however,  to  show  on  geometric  grounds  only 
that  the  ratio  k  is  invariant  independently  of  where  the 
.second  view  is  located,  because  ratios  are  not  generally 
preserved  under  projectivity  (only  cross-ratios  are).  In 
fact,  as  w'e  saw.  k  is  invariant  but  up  to  a  uniform  scale, 
therefore,  for  any  particular  optical  line  the  ratio  is  not 
preserved.  It  is  for  this  reason  that  algebra  was  intro¬ 
duced  in  .Section  3  for  the  derivation  of  affine-depth. 

Consider  next  the  difference  between  the  affine  and 
th<‘  projective  frameworks.  We  have  seen  that  from  a 
theoretical  standpoint,  a  projective  invariant,  such  as 
projective-depth  k  in  Equation  2.  is  really  nece.ssary 
when  a  reference  view  is  not  available.  For  example,  as¬ 
sume  we  have  a  sequence  of  ri  views  I'l . of  a 

scene  and  we  wish  to  recover  its  3D  structure.  An  affine 
framework  would  result  if  we  choose  one  of  the  views, 
say  c'„,  as  a  reference  view,  and  compute  the  structure 
as  seen  from  that  camera  location  given  the  correspon¬ 
dences  I'o  =>  xi'i  with  all  the  remaining  views  —  this  is  a 
common  approach  for  recovering  metric  structure  from 
a  sequence.  Because  affine-depth  is  invariant,  we  have 
n  —  1  occurrences  of  the  same  measurement  k  for  every 
point,  which  can  be  used  as  a  source  of  information  for 
a  least-squares  solution  for  k  (or  naively,  simply  average 


the  n  —  1  mecisurements).  Now  consider  the  projective 
framework.  Projective-depth  k  is  invariant  for  any  two 
views  ,  C’j  of  the  sequence.  We  have  therefore  n(n  —  1) 
occurrences  of  k  which  is  clearly  a  stronger  source  of 
information  for  obtaining  an  over-determined  solution. 
The  conclusion  from  this  example  is  that  a  projective 
framework  has  practical  advantages  over  the  affine,  even 
in  cases  where  an  affine  framework  is  theoretically  suffi¬ 
cient.  There  are  other  practical  considerations  in  favor 
of  the  projective  framework.  In  the  affine  framework,  the 
epipole  v'  plays  a  double  role  —  first  for  computing  the 
collineation  T,  and  then  for  computing  the  affine-depth 
of  all  points  of  interest.  In  the  projective  framework,  the 
epipoles  are  used  only  for  computing  the  collineations  .4 
and  E  but  not  used  for  computing  k.  This  difference 
has  a  practical  value  as  one  would  probably  like  to  have 
the  epipoles  play  as  little  a  role  as  possible  because  of 
the  difficulty  in  recovering  their  location  accurately  in 
the  presence  of  noise.  In  industrial  applications,  for  ex¬ 
ample,  one  may  be  able  to  set  up  a  frame  of  reference 
of  two  planes  with  four  coplanar  points  on  each  of  the 
planes.  Then  the  collineations  .4  and  E  can  be  com¬ 
puted  without  the  need  for  the  epipoles,  and  thus  the 
entire  algorithm,  expressed  in  Equation  2.  can  proceed 
without  recovering  the  epipoles  at  all. 
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Appendix 


A  Pr*''of  of  Proposition 


Propositna  1  Given  an  arbitrary  view  il'o  €  <5^  gener¬ 
ated  by  a  ct  lera  with  COP  at  initial  position  O.  then  all 
other  views  ■  £  <5i  can  be  generated  by  a  rigid  motion 
of  the  came?  frame  from  its  initial  position,  if  in  addi¬ 
tion  to  taking  pictures  of  the  object  we  allow  any  finite 
sequence  of  pictures  of  pictures  to  be  taken  as  well. 


Lemma  1  The  set  of  views  can  be  generated  by  a 
rigid  camera  motion,  starting  from  some  fixed  initial  po¬ 
sition.  followed  by  some  collineation  in  T’ . 

Proof:  We  have  shown  that  any  view  w  £  Sn  can  be 
generated  by  satisfying  Equation  1,  reproduced  below: 

p'  ^  .4p  -  kv'. 

Note  that  =  0  for  all  P  £  tr.  First,  we  transform  the 
coordinate  system  to  a  camera  centered  by  sending  n  to 
infinity:  Let  M  £  GL4  be  defined  as 


M  = 


•  1 
0 
0 
1 


0 

1 

0 

1 


0  0  • 
0  0 
1  0 
1  1 
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We  have: 


p'  —  .4p  —  kv' 


=  f.4,-r'l|  f 


a  [.4,-c'].V/-‘ 


/ 

yb 

^b 

V  1 


"TSr"' 


where  Xt,  =  x/(x  -f  1  -)-  =  <//(■''  +  .V  +  1  + 

and  C(,  =  l/( j-  -I-  j/  -I-  1  -I-  L).  Let  R  be  a  rotation  matrix 
in  3D.  i.e..  R  £  GL3.  det(R)  =  1,  and  let  fit  denote  a 
collineation  in  T'.  i.e..  B  £  GL3.  and  let  tr  be  .some 
vector  in  3D,  Then,  we  must  show  that 


p'  S  PP  yl  j  +  Bu  . 

For  every  R.  B  and  w.  there  exists  S  and  u  that  produce 
the  same  image,  simplv  be  setting  .S'  =  BR  and  u  =  Bw. 
We  must  also  show  that  for  every  fi'  and  u  there  exists 
R.B  and  w  that  produce  the  same  image:  Since  .S  is  of 
full  rank  (becasue  .4  is),  then  the  claim  is  true  by  simply 
setting  B  —  SR^  and  w  =  S“‘u.  for  any  arbitrary 
orthogonal  matrix  R.  In  conclusion,  any  view  f  £ 
can  be  generated  by  some  rigid  motion  R.  w  starting 
from  a  fixed  intial  position,  followed  by  some  collineation 
B  of  the  image  plane.  [] 

We  need  to  show  next  that  any  collineation  in  V' 
can  be  expressed  by  a  finite  sequence  of  views  taken 
by  a  rigidly  moving  camera,  i.e,,  calibrated  camera  It 
is  worthwhile  noting  that  the  equivalence  of  projective 
transformations  (an  algebraic  concept)  with  a  finite  se¬ 
quence  of  projections  of  the  plane  onto  itself  (a  geometric 
concept)  is  fundamental  in  projective  geometry.  For  ex¬ 
ample,  it  is  known  that  any  projective  transformation  of 
the  plane  can  be  obtained  as  the  resultant  of  a  finite  se¬ 
quence  of  projections  [32,  Thm.  10,  pp.  74],  The  ques¬ 
tion,  however,  is  whether  the  equivalence  holds  when 
projections  are  restricted  to  what  is  generally  allowed 
in  a  rigidly  moving  camera  model.  In  other  words,  in 
a  sequence  of  projections  of  the  plane,  we  are  allowed 
to  move  the  COP  anywhere  in  V^.  the  image  plane  is 
allowed  to  rotate  around  the  new  location  of  the  COP 
and  scale  its  distance  from  it  along  a  distinguishable  axis 
(scaling  focal  length  along  the  optical  axis).  What  is  not 
allowed,  for  example,  is  tilting  the  image  plane  with  re¬ 
spect  to  the  optical  axis  (that  has  the  effect  of  changing 
the  location  of  the  principal  point  and  the  image  scale 
factors  —  all  of  which  should  remain  constant  in  a  cali¬ 
brated  camera).  Without  loss  of  generality,  the  camera 
is  set  such  that  the  optical  axis  is  perpendicular  to  the 
image  plane,  and  therefore  when  the  COP  is  an  ideal 
point  the  projecting  rays  are  all  perpendicular  to  the 
plane,  i.e.,  the  case  of  orthographic  projection. 


The  equivalence  between  a  sequence  of  perspec- 
tive/ortliographic  views  of  a  plane  and  projective  trans¬ 
formations  of  the  plane  is  shown  by  first  reducing  the 
problem  to  scaled  orthographic  projection  by  takiikg  a 
sequence  of  two  perspective  projections,  and  then  using 
a  result  of  [30.  1 1]  to  show  the  equivalence  for  the  scaled 
orthographic  case.  The  following  two  auxilary  proposi¬ 
tions  are  used: 

Leiuuia  2  Then  ts  a  unique  project  transformation 
of  the  plane  in  which  a  given  line  u  is  mappeet  onto 
an  ideal  line  (has  no  image  in  the  real  plane)  and 
which  maps  noii-colltnear  points  A.B.C  onto  given  noii- 
colltnear  points  T'.  B' .C . 

Proof:  This  is  standard  material  (cf.  [7,  pp.  178]).  [] 

Lemma  3  There  is  a  scaled  orthographic  projection  for 
any  given  affine  transformation  of  the  plane. 

Proof:  follows  directly  from  [30,  11]  showing  that  any 
given  affine  transformation  of  the  plane  can  be  obtained 
by  a  unique  (up  to  a  reflection)  3D  similarity  transform 
of  the  plane  followed  by  an  orthographic  projection.  Q 

Lemma  4  There  is  a  finite  sequence  of  perspective  and 
scaled  orthographic  views  of  the  plane,  taken  by  a  cali¬ 
brated  camera,  for  any  given  projective  transformation 
of  the  plane . 

Proof:  The  proof  follows  and  modifies  [7.  pp.  179],  VVe 
are  given  a  plane  ei  and  a  projective  transformation  T. 
[f  T  is  affine,  then  by  Lemma  3  the  proposition  is  true. 
If  T  is  not  affine,  then  there  exists  a  line  u  in  o  that 
is  mapped  onto  an  ideal  line  under  T.  Let  A.B.C  be 
three  nou-collinear  points  which  are  not  on  u.  and  let 
their  image  under  T  be  .V.B'.C'.  Take  a  perspective 
view  onto  a  plane  o'  such  that  u  has  no  image  in  o'  (the 
plane  o'  is  rotated  around  the  new  COP  such  that  the 
plane  passing  through  the  COP  and  ii  is  parallel  to  o'). 
Let  .Ai,  Bi.  Cl  be  the  images  of  .1,  B.  C  in  o'.  Project  o' 
back  to  o  by  orthographic  projection,  and  let  Ts.  B^j.Cs 
be  the  image  of  Ti./?i,(  j  in  o.  Let  F  be  the  resultant 
of  these  two  projections  in  the  stated  order.  Then  F 
is  a  projective  transformation  of  o  onto  itself  such  that 
u  has  no  image  (in  the  real  plane)  and  A.B.C  go  into 
.A-j.  B-j.C'j.  From  Lemma  3  there  is  a  viewpoint  and  a 
scaled  orthographic  projection  of  n  onto  o"  such  that 
.An.B-2.C-2  go  into  .A'.B'.C.  respectively.  Let  L  be  the 
resultant  of  this  projection  (Z,  is  affine).  T  =  FL  is  a 
projective  transformation  of  o  such  that  u  has  no  image 
and  .A.B.C  go  into  .A'.B'.C.  By  Lemma  2.  T  =  T 
(projectively  speaking,  i.e..  up  to  a  scale  factor).  [] 
Proof  of  Proposition:  follows  directly  from 

Lemma  1  and  Lemma  4.  [] 
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